Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112

abaybektursun · 2026-03-28T13:34:01Z

Mechanistic Interpretability: For a deep-dive analysis of this model — including layer-by-layer quantization sensitivity, logit lens interpretability, and calibration measurements — see the companion blog post: Mechanistic Interpretability of PR #1019

Recreated from PR #728 at @valerio-oai's request for clarity. PR #728 originally submitted val-calibrated files; this PR contains only the AR self-generated calibration submission with clean history.

val_bpb: 1.1147 (3-seed mean) | ~15.91 MB | 8×H100 SXM, 600s | No TTT

This submission uses only AR (autoregressive) self-generated calibration data. After training, the model autoregressively generates its own calibration tokens. No val data and no train data are accessed during quantization. The calibration study below is provided separately to help the community understand GPTQ calibration — it is not part of this submission.

SOTA (from our PR #549, 3-seed mean): 1.89002 nats. This run: 1.88218 nats. Delta: −0.0078 nats. Clears the 0.005-nat threshold.

Results (3-seed)

Seed	Steps	ms/step	Sliding BPB	val_loss (nats)	Artifact
314	6,927	86.6	1.1151	1.8828	15,863,278
42	6,922	86.7	1.1144	1.8816	15,984,850
999	6,917	86.8	1.1148	1.8822	15,876,310
Mean			1.1147	1.8822

Changes from Prior SOTA (our PR #549)

PR #549 scores 1.1194 BPB using GPTQ-lite + Legal TTT + Parallel Muon + BigramHash(1536) + XSA on last 4 layers. This submission makes three changes and drops TTT:

1. AR Self-Generated Full Hessian GPTQ

PR #549 used GPTQ-lite (diagonal Hessian approximation). We use Full Hessian GPTQ with Cholesky error compensation and column reordering — a strictly better quantizer.

The calibration problem: prior Full Hessian GPTQ implementations (PRs #535, #569, #593, #609) calibrated on training data, ruled illegal after the 600s window. We solve this by having the model generate its own calibration data. After training completes, the model autoregressively generates 64 sequences of 2048 tokens (temperature=0.8, fixed seed). Hessians H = X^T X are collected from these self-generated sequences. No val data, no train data accessed during quantization.

2. BigramHash 3072 × 112 (up from 1536)

Lineage: our PR #549 (1536) → PR #609 (2048) → this run (3072 × dim=112). Fits under 16MB; going wider increased artifact pressure past the break-even point.

3. XSA on all 11 layers (up from last 4)

PR #549 applied XSA to the last 4 layers. Extending to all 11 layers forces cross-position information mixing from layer 0 at zero parameter cost. Source: PR #478 by @gowtham0992.

Dropped: TTT

PR #549 used Legal Score-First TTT for −0.0025 BPB. On this stack, TTT is neutral or negative (25 failed attempts across two stacks — see our PR #756). XSA-all already captures the inter-document context patterns that TTT was adapting to. The Full Hessian GPTQ improvement more than compensates for dropping TTT.

Quantization Pipeline

Stage	BPB
Pre-quant (post-EMA)	1.1354
Post-GPTQ int6 roundtrip	1.1377 (+0.0023 gap)
Post-GPTQ sliding (AR self-gen)	1.1147

Architecture

Component	Setting	Source
Layers	11 (512d, 8 GQA / 4 KV heads)	Baseline
MLP	3× (1536), LeakyReLU(0.5)²	#493 @parinzee
Attention	XSA on all 11 layers	#478 @gowtham0992
BigramHash	3072 × 112	This work (concept: #162 @raahilshah)
RoPE	Partial (16/64 dims)	#315 @jfprincz
LN Scale	1/√(layer+1)	#315 @jfprincz
VE128	Layers 9-10	#374 @unnir
SmearGate	Position-mixing gate	#65 @aquariouseworkman
U-Net skips	Encoder-decoder connections	#289
Weight avg	EMA(0.997) + SWA(every 50)	#401 @newjordan
Quantization	Full Hessian GPTQ int6 (AR self-gen calibration)	This work (GPTQ: #535 @raahilshah)
Compression	LZMA preset=9	#160 @ChaseWNorton
Warmdown	4000 iterations	#364 @shikhar1729
Optimizer	Parallel Muon	Our #399
Late QAT	STE at LR scale < 0.15	#286 @chris-buckley
Selective pruning	±1 by reconstruction error	#609 @saml212
Flash Attention 3	Hopper kernels	#122 @mtybadger

Run Command

BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 WARMDOWN_ITERS=4000 \
TARGET_MB=15.9 SEED=314 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Community Reference: GPTQ Calibration Study

This section is not part of the submission. It documents our investigation into what calibration data GPTQ actually needs — shared here to help the community, since GPTQ calibration legality has been a recurring question in this competition (PRs #535, #569, #593, #609, #639).

The question

GPTQ calibration was the source of a legality dispute in this competition. PRs #593 and #609 used training data for calibration and were rejected or flagged. We initially used val data instead, which raised its own question: is val-data calibration legal? To answer this definitively, we investigated whether the model can calibrate itself with no external data at all — which is what the submission above does.

Single-checkpoint ablation

Same trained weights (seed 314), 5 calibration methods, no retraining. This ablation isolates calibration source on a single checkpoint.

#	Calibration Source	Tokens	Time	Sliding BPB	vs Val-calib
1	Val data	~50M	~5s	1.1145	—
2	Autoregressive self-generation (used in submission)	131K	186s	1.1148	+0.0003
3	Random tokens (64 batches)	131K	3.4s	1.1165	+0.0020
4	Random tokens (256×48 batches)	25M	35s	1.1165	+0.0020
5	Gibbs-refined (3 rounds)	6.3M	24s	1.1166	+0.0021

Confirmed on a second checkpoint (BigramHash 2048×128, 8×H100) with consistent relative gaps: val 1.11626, AR 1.11657, random 1.11816.

Val-calibrated 3-seed results (not submitted, reference only)

For comparison, the same stack with val-data GPTQ calibration instead of AR self-gen:

Seed	Steps	ms/step	Pre-quant BPB	Sliding BPB	val_loss (nats)	Artifact
314	6,952	86.3	1.1340	1.1141	1.8813	15,855,088
42	6,952	86.3	1.1341	1.1142	1.8815	15,853,088
999	6,945	86.4	1.1343	1.1143	1.8817	15,866,156
Mean			1.1341	1.1142	1.8815

AR self-gen is 0.0006 BPB worse than val-calibrated. Both clear the SOTA threshold.

Full quantization pipeline comparison

Stage	BPB
Pre-quant (post-EMA)	1.1354
Post-GPTQ int6 roundtrip	1.1377 (+0.0023 gap)
Post-GPTQ sliding (val-calib)	1.1142
Post-GPTQ sliding (AR self-gen)	1.1147
Post-GPTQ sliding (random self-gen)	1.1165

Findings

Autoregressive self-generation closes 84% of the val-vs-random gap (0.0017 of 0.0020 BPB). The gap between val-calibrated and random-token calibration is predominantly natural language vs random noise. Coherent text from the model's own distribution produces Hessians nearly identical to val data.
The remaining 0.0003 BPB is P_model vs P_data divergence. The model's output distribution is a 27M-parameter approximation of the FineWeb data distribution. This small residual gap measures how far the model's internal activation patterns have drifted from those of real text. It is negligible.
Gibbs refinement does not help (1.1166 vs 1.1165 for plain random). Gibbs replaces tokens in-place conditioned on still-mostly-random neighbors — it does not produce coherent text. Autoregressive generation builds coherent sequences left-to-right, which is what produces natural-language-like activations.
More random tokens do not help. 131K and 25M tokens give identical BPB (1.1165). The Hessian converges quickly at int6 — it mainly needs to identify dead columns and relative importance, which are properties of the model's weights, not input statistics.
Every calibration method beats SOTA. Even the worst (random tokens, 1.1165) beats the previous SOTA (our PR #549, 1.1194) by 0.003 BPB.

See our PR #756 for additional negative results (Qronos, CDQuant, TTT, Spectral Init, SLOT) on this stack.

🤖 Generated with Claude Code

@valerio-oai

…11473 (3-seed mean) AR self-generated calibration (no val/train data during quantization). Recreated from PR openai#728 at @valerio-oai's request for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Eppie · 2026-03-28T15:09:05Z

Nice job, looks clean/valid to my non-expert eyes!

- AdamW TTT adapts full-precision EMA weights before GPTQ - Score-first approach (inference_mode then train) for compliance - Hyperparams: lr=0.0005, epochs=3, chunk=32768, cosine decay - 3-stage timing: TTT / AR self-gen+GPTQ / final eval - Uses _HessianGPT (non-banked) for TTT, rebanks for AR self-gen - Kill criteria: seed=1337 must reach <= 1.1156 BPP

- copy the old openai#1019 execution path into the experiment branch - add score-first AdamW TTT on the dequantized int6 eval model - default TTT on with 1 epoch for the narrow smoke path - instrument ar_selfgen_gptq/final_eval/post_quant_ttt timing

All cache targets (openai#868, openai#913, openai#933) were closed by the organizer. Retarget operator to PR openai#549 (accepted SOTA) and PR openai#1019. Sync upstream code, create run specs, update policy and campaign. Rewrite grant application for $500 development tier. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…on GPTQ, loss truncation 8 experiments on PR openai#1019 stack: 1 positive (memmap −0.0033), 7 negative. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…816 (val_bpb 1.1116) 3-seed mean: 1.1116 ± 0.0005 Seeds: 1337=1.1110, 42=1.1121, 2024=1.1118 Stack: LeakyReLU² fused Triton kernel + Full Hessian GPTQ (actorder+Cholesky) + coprime-stride multi-shard loader + XSA on all 11 layers + BigramHash(2816x112) + fullgraph=True torch.compile Built on PR openai#549 scaffold with techniques from PRs openai#726, openai#634, openai#1019, openai#287.

valerio-oai · 2026-03-30T16:58:23Z

Looks good to me too, and clears the 0.005 nats bar. Merging into the leaderboard!

abaybektursun · 2026-03-30T17:30:38Z

@valerio-oai Thanks! 🥳

…n path The export-only replay showed 66/66 layers worse than naive regardless of actorder or block_size, pointing to the upstream Hessian path as the root cause. This patch aligns Hessian collection with PR openai#1019/openai#634 semantics: - Divide accumulated H by num_batches (was raw sum — caused scale blowup) - Add 1% diagonal damping in _finalize_hessians before quantization - Run calibration forward pass under torch.autocast(bf16) to match training - Accumulate Hessians on CPU to avoid GPU memory pressure Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous replay_ref_hfix still showed 66/66 worse layers with the Hessian normalization fix. Rather than continuing to debug from symptoms, this transplants PR openai#1019's complete GPTQ slice verbatim: - collect_hessians: PR openai#1019 hook pattern with pre-init and param_name keys - quantize_int6_gptq: verbatim from PR openai#1019 lines 1171-1224 - gptq_mixed_quantize_int6: direct param_name key lookup, PR openai#1019 quantizer Source: pr-1019-gptq:records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/train_gpt.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Set GPTQ_AR_CALIB=1 to generate 64 autoregressive sequences (temp=0.8) from the model itself instead of using training data for Hessian collection. This matches PR openai#1019's actual calibration strategy. Both paths available — training data (default) and AR self-gen (opt-in). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… + warmdown3500 + LeakyReLU²) Four training-quality improvements on Session 03 anchor: - warmdown 3000 → 3500 - XSA 4 → 11 (all layers) - VE128 on layers 9-10 (shared ValueEmbedding, 128→256 proj) - LeakyReLU(0.5)² replacing ReLU² SWA excluded (dead code in PR openai#1019 and openai#634). GPTQ excluded (parked; replay requires separate merge step). Base: Session 03 anchor, not Session 05b. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 29, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

dexhunter mentioned this pull request Mar 29, 2026

Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) #1060

Open

5 tasks

resouer pushed a commit to resouer/parameter-golf that referenced this pull request Mar 29, 2026

exp: bounded openai#1019 lineage port on current repo

f750340

This was referenced Mar 29, 2026

Non-record: Negative results — eval-time interventions, mixed-precision GPTQ, loss truncation #1103

Open

Record: Fused MLP (Triton+CUTLASS EVT) + MLP 3.5× + Mixed int5/int6 + SLOT + Brotli — 1.1088 BPB (3-seed mean) #1105

Open

NewyorkDev mentioned this pull request Mar 30, 2026

Record: 1.1194 BPB — v9 Batched Muon + Full GPTQ Random Calib + JEPA Research #1124

Open

6 tasks

barneywohl mentioned this pull request Mar 30, 2026

Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116) #1135

Open

valerio-oai approved these changes Mar 30, 2026

View reviewed changes

valerio-oai merged commit 2443851 into openai:main Mar 30, 2026

abaybektursun mentioned this pull request Mar 30, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean)#1019